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We consider a situation where the state of a system is represented by a real- valued vector x G M". 
Under normal circumstances, the vector x is zero, while an event manifests as non-zero entries in 
X, possibly few. Our interest is in designing algorithms that can reliably detect events — i.e., test 
whether x = 0orx7^0 — with the least amount of information. We place ourselves in a situation, 
now common in the signal processing literature, where information on x comes in the form of 
noisy linear measurements y = (a, x) + z, where a € has norm bounded by 1 and z £ AA(0, 1). 
We derive information bounds in an active learning setup and exhibit some simple near-optimal 
algorithms. In particular, our results show that the task of detection within this setting is at once 
much easier, simpler and different than the tasks of estimation and support recovery. 
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1 Introduction 

We consider a situation where the state of a system is represented by a real- valued vector x G M". 
Under normal circumstances, the vector x is zero, while an event manifests as non-zero entries in 
X, possibly few. Our interest is in the design of algorithms that reliably detect events — i.e., test 
whether x = 0orx7^0 — with the least amount of information. We assume that we may learn 
about X via noisy linear measurements of the form 



where the measurement vectors aj's have Euclidean norm bounded by 1 and the noise Zj's are 
i.i.d. standard normal. Assuming that we may take a limited number of linear measurements, the 
engineering is in choosing them in order to minimize the false alarm and missed detection rates. 
We derive information bounds, establishing some fundamental detection limits relating the signal 
strength and the number of linear measurements. The bounds we obtain apply to all adaptive 
schemes, where we may choose the ith measurement vector a^ based on the past measurements, 
i.e., we may choose a^ as a function of (ai, yi, . . . , aj_i, 

1.1 Related work 

Learning as much as possible about a vector based on a few linear measurements is one of the central 
themes of compressive sensing (CS) Most of this literature, as it relates to signal processing, 

has focused on the tasks of estimation and support recovery. Particularly in surveillance situations, 
however, it makes sense to perform detection before estimation because, as we shall confirm, reliable 
detection is possible at much lower signal-to-noise ratios or, equivalently, with much fewer linear 
measurements than estimation. This can be achieved with much greater implementation ease and 
much lower computational cost than standard CS methods based on convex programming. 

The literature on the detection of a high-dimensional signal is centered around the classical 
normal mean model, based on observations yi = Xi + Zi, where the Zj's are i.i.d. standard normal. 
In this model, only one noisy observation is available per coordinate, so that some assumptions are 
necessary and the most common one, by far, is that the vector x = (xi, . . . is sparse. This 
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setting has attracted a fair amount attention 0, E^. 01 , with recent pubhcations allowing adaptive 
measurements I4] . More recently, a few papers [2, lid . Il4f | extended these results to testing for a 
sparse coefficient vector in a linear system with the aim of characterizing the detection feasibility. 
These papers work with designs having low mutual coherence, for example, assuming that the a^'s 
are i.i.d. multivariate normal. As we shall see below, such designs are not always desirable. We also 
mention [3] , which assumes that an estimator x of x is available and examines the performance of 
the test based on (x, x); and [17|, which proposes a Bayesian approach for the detection of sparse 
signals in a sensor network for which the design matrix is assumed to have some polynomial decay 
in terms of the distance between sensors. 

We mention that the present paper may be seen as a companion paper to [H which considers 
the tasks of estimation and support recovery in the same setting. 



1.2 Notation and terminology 

Our detection problem translates into a hypothesis testing problem Hq : x = versus Hi : x G A', 
for some subset X C M" \ {0}. A test procedure based on m measurements of the form ([1]) is a 
binary function of the data, i.e., T = T{a.i,yi, . . . ,a.m,ym), with T = e £ {0, 1} indicating that T 
favors H^. The (worst-case) risk of a test T is defined as 

7(r) := ¥o{T = 1) + max P,(r = 0), 

where Px denotes the distribution of the data when x is the true underlying vector. With a prior 
vr on the set of alternatives X, the corresponding average (Bayes) risk is defined as 

7.(T) :=Po(r= l)+E,Px(r = 0), 

where E^r denotes the expectation under vr. Note that for any prior vr and any test procedure T, 

liT) > 7.(r). (2) 

For a vector a = (ai, . . . , a^). 



and a"^ denote its transpose. For a matrix M, 

l|Ma|| 

||M||op = sup . 

aT^O l|a|| 

Everywhere in the paper, x = {xi, . . . , Xn) denotes the unknown vector, while 1 denotes the vector 
with all coordinates equal to 1 and dimension implicitly given by the context. 



1.3 Content 

In Section [2] we focus on vectors x with non- negative coordinates. This situation leads to an 
exceedingly simple, yet near-optimal procedure based on a measurement scheme that is completely 
at odds with what is commonly used in CS. In Section [3 we treat the case of a general vector 
X and derive another simple, near-optimal procedure. In both cases, the methods we suggest 
are non-adaptive — in the sense that the measurement vectors are chosen independently of the 
observations — yet perform nearly as well as any adaptive method. In Section U] we discuss our 
results and important extensions, particularly to the case of structured signals. 



3 



2 Vectors with non-negative entries 

Vectors with non-negative entries may be relevant in image processing, for example, where the 
object to be detected is darker (or lighter) than the background. As we shall see, detecting such a 
vector is essentially straightforward in every respect. In particular, the use of low-coherence designs 
is counter-productive in this situation. 

The first thing that comes to mind, perhaps, is gathering strength across coordinates by mea- 
suring X with the constant vector \/y/n. And, with a budget of m measurements, we simply take 
this measurement m times. 

Proposition 1. Suppose we take m measurements of the form ([1]) with = l/\/n for all i. 
Consider then the test that rejects when 

m 

^ ?/i > T^/m, 
i=l 

where r is some critical value. Its risk against a vector x is equal to 



1 - <5(r) -I- ^{t - ^/m/n\x.\), 

where $ is the standard normal distribution function. In particular, if t = Tn ^ oo, this test has 
vanishing risk against alternatives satisfying y^m/n\x.\ — t„ — )■ oo. 

Since we may chose Tm — )• oo as slowly as we wish, in essence, the simple sum test based on 
repeated measurements from the constant vector has vanishing risk against alternatives satisfying 
m/n\x\ —7- oo. 

Proof. The result is a simple consequence of the fact that 

^ m 

—^y^yi ~ A/'(\/m7n|x|,l). 

^ 1=1 

□ 

Although the choice of measurement vectors and the test itself are both exceedingly simple, 
the resulting procedure comes close to achieving the best possible performance in this particular 
setting, as the following information bound reveals. 

Theorem 1. Let X{fi, S) denote the set of vectors in M" having exactly S non-zero entries all equal 
to /X > 0. Based on m measurements of the form ([1]), possibly adaptive, any test for Hq : x = 
versus Hi : x G X{fi,S) has risk at least 1 — y^m/{8n)Sfi. 

In particular, the risk against alternatives x G ^(^, S) with y^m/n\x\ = y^m/nSfi — )• 0, goes 
to 1 uniformly over all procedures. 

Proof. The standard approach to deriving uniform lower bounds on the risk is to put a prior on the 
set of alternatives and use ([2]). We simply choose the uniform prior on X(^,S), which we denote 
by TT. The hypothesis testing problem reduces to Hq : x = versus Hi : x ~ vr, for which the 
likelihood ratio test is optimal by the Neyman-Pearson fundamental lemma. The likelihood ratio 
is defined 

L := ^ ,^ ^ — — ^ = exp ^ yi^a.^ x) - (a^ x) /2 , 



3-1) 2/1) • • • ) Vm) 
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where E^r denotes the expectation with respect to vr, and the related test is T = {L > 1}. It has 
risk equal to 

7.(r) = l-^||P.-Po||TV, (3) 

where P^r := Px — the vr-mixture of Px — and || • ||tv is the total variation distance. By Pinsker's 
inequality 

IIP. - PoIItv < ^A-(Po,P.)/2, (4) 
where K{¥q,¥t^) denotes the Kullback-Leibler divergence. We have 

K(Fo,F^) = -EologL (5) 

m 

< E.^Eo(y.(afx)-(afx)V2) (6) 

i=l 



E^j;;^o(afx)V2 (7) 



i=l 

m 



5]Eo(afCa,) (8) 



i=l 

< m||C||op, (9) 

where C = {cjk) '■= E^(xx"^). The first line is by definition; the second is by definition of Px/Pq, 
by the application of Jensen's inequality justified by the convexity of x — t- — logx, and by Fubini's 
theorem; the third is by independence of aj, yi and x (under Pq), and by the fact that E(yj) = 0; 
the fourth is by independence of aj and x (under Pq) and by Fubini's theorem; the fifth is because 
||aj|| < 1 for all i. 

Since under vr the support of x is chosen uniformly at random among subsets of size S, we have 

S 

Cjj = /i^ P^(xj / 0) = /i^ • - , Vj, 

and 

S S — 1 

Cjk = l-l'^ Fn{Xj / 0, Xk / 0) = -, j ^ k. 

n n — 1 

This simple matrix has operator norm ||C||op = fi'^S'^/n. 
Coming back to the divergence, we therefore have 

i^'(Po,P7r) <m- fi'^S'^/n, 

and returning to ^ via (HD, we bound the risk of the likelihood ratio test as follows 

7(r) > 1 - y/K{Fo,¥^)/8 > 1 - VW(8n)5/x. 

□ 

With Proposition [1] and Theorem [H we conclude that the following is true in a minimax sense: 
Reliable detection of a nonnegative vector x G M" from m noisy linear measurements is 



possible if y^m/n|x| — ?• oo and impossible if y'm/nlx.l — t- 0. 
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3 General vectors 

When dealing with arbitrary vectors, the measurement vector \/n may not be appropriate. In 
fact, the resulting procedure is completely insensitive to vectors x such that (l,x) = 0. Neverthe- 
less, if one selects a measurement vector a from the Bernoulli ensemble — i.e., with independent 
entries taking values ibl/-y/n with equal probability — then on average, (a, x) is of the order of 
||x||/\/n. This is true when the number of non-zero entries in x grows with the dimension n; if we 
repeat the process a few times, it becomes true for any fixed vector x. 

Proposition 2. Sample bi,...,b/j^ independently from the Bernoulli ensemble, with — oo 
slowly, and take m measurements of the form ([1]) with a^ = for i £ Ig := [{m/hm){s — 1) + 
1, {m/hfyi)s), s = 1, . . . , km- Consider then the test that rejects when 

2 

> m{l + Tm./^/Ki), (10) 

where Tm — )• oo. When m — )• oo, its risk against a vectors. — averaged over the Bernoulli ensemble 
— vanishes i/ (?7i/n)||x|p > 2TmV^m- 

Since we may take km and increasing as slowly as we please, in essence, the test is reliable 
when (m/n)||x|p — )• oo. Compared with repeatedly measuring with the constant vector l/-v/n as 
studied in Proposition [H there is a substantial loss in power when |xp is much larger than ||x|p. 
For example, when x has S non-zero entries al equal to ^ > 0, |xp = S'Uxp. 

Proof. For simplicity, assume that m/hm is an integer and fix x throughout. For short, let 
Ys = ^yi = (m//i„)(bs,x) + ^Jm/hmZs, Zs := ^/h„Jm^Zi. 

Note that the Z^'s are i.i.d.~ M{0, 1), while the (bs,x)'s are i.i.d. with mean zero, variance ||x|p/n 
and fourth moment bounded by 6||x||^/n^ — which is immediate using the fact that the coordinates 
of b^ are i.i.d. taking values ztl/y^ with equal probability. Proceeding in an elementary way, we 
have 

= (mV/im)IE((bi,x)2) +mE (Zf) = {m'' /h^)M? /n + m, 

(mV/i^)E ((bi,x)4) + ^{m^/hl)¥. ((bi,x)2) E (Zf) + (mV/i,n)E [Zf) 

= 6(mV/i™)||x||Vf^^ + 6(mV/i™)||x||V^ + 3(mV/im,). 

Therefore, by Chebyshev's inequality, the probability of (jlOp under the null is bounded from above 
by — 0. Similarly, the probability of (fTOj) not happening under an alternative x satisfying 

(?n/n) ||x|p > 2Tra\/Thn is boundcd from above by 

6(mV/^^)||x||Vn2 + 6(m3//i^)||x||Vn + 3(mV/t„,) ^ 24 ^ 24 , 3 ^ ^ 
((m2//im)||x||2/n - mr„i/v^)^ ~ TmVh^ 

□ 
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Again, this relatively simple procedure nearly achieves the best possible performance. 

Theorem 2. Let X^{fi,S) denote the set of vectors in M" having exactly S non-zero entries all 
equal to it/i. Based on m measurements of the form ([T]), possibly adaptive, any test for Hq : x = 
versus Hi : x € A!^{fi, S) has risk at least 1 — Sm/{8n)fi. 

In particular, the risk against alternatives x G X^[fi,S) with (?7T,/n)||x|p = {7n/n)Sfj? — )■ 0, 
goes to 1 uniformly over all procedures. 

Proof. Again, we choose the uniform prior on A!^{fi,S). The proof is then completely parallel to 
that of Theorem [H now with C = ^^{S/n)\ — since the signs of the nonzero entries of x are 
i.i.d. Rademacher — so that ||C||op = /x^S'/re. □ 

With Proposition [2] and Theorem [51 we conclude that the following is true in a minimax sense: 

Reliable detection of a vector x G M" from m noisy linear measurements is possible if 
Y^m/n||x|| — )■ oo and impossible if Y^m/n||x|| — )• 0. 

4 Discussion 

In this short paper, we tried to convey some very basic principles about detecting a high-dimensional 
vector with as few linear measurements as possible. First, when the vector has non-negative entries, 
repeatedly sampling from the constant vector 1/y/n is near-optimal. Second, when the vector is 
general but sparse, repeatedly sampling from a few measuring vectors drawn from a standard 
random (e.g., Bernoulli) ensemble is also near-optimal. In both cases, choosing the measuring 
vectors adaptively does not bring a substantial improvement. And, moreover, sparsity does not 
help, in the sense that the detection rates depend on |x| and ||x||, respectively. 

4.1 A more general adaptive scheme 

Suppose we may take as many linear measurements of the form ([1]) as we please (possibly an infinite 
number), with the only constraint being on the total measurement energy 

Y,\\^if<m. (11) 

i 

(Note that m is no longer constrained to be an integer.) This is essentially the setting considered 
m and clearly, the setup we studied in the previous sections satisfies this condition. So 

what can we achieve with this additional flexibility? 

In fact, the same results apply. The lower bounds in Theorem [1] and Theorem [2] are proved in 
exactly the same way. (We effectively use pT]) to go from ([8]) to ([9]), and this is the only place 
where the constraints on the number and norm of the measurement vectors are used.) Of course. 
Proposition [1] and Proposition [2] apply since the measurement schemes used there satisfy (jlip . 
However, in this special case they could be simplified. For instance, in Proposition [T] we could take 
one measurement with the constant vector ^Jm/n 1. 



7 



4.2 Detecting structured signals 

The results we derived are tailored to the case where x has no known structure. What if we know 
a priori that the signal x has some given structure? The most emblematic case is when the support 
of X is an interval of length S. In the classical setting where each coordinate of x is observed once, 
the scan statistic (aka generalized likelihood ratio test) is the tool of choice [3|]. How does the story 
change in the setting where adaptive linear measurements in the form of ([TJ can be taken? 

Perhaps surprisingly, knowing that x has such a specific structure does not help much. Indeed, 
Theorem [T] and Theorem [2] are proved in the same way. In the case of non-negative vectors, 
we use the uniform prior on vectors with support an interval of length S and nonzero entries 
all equal to /i, and the proof is identical, except for the matrix C, which now has coefficients 
Cjk = max(S' — I j — k\,0)/n for all j, k. Because C is symmetric, we have 

||C||op < max^^ \cjk\ = /i^5^/n, (12) 
^ k 

which is exactly the same bound as before. In the general case, the arguments are really identical, 
except that we use uniform prior on vectors with support an interval of length S and nonzero entries 
all equal to ^ in absolute value. (Here the matrix C is exactly the same.) Of course. Proposition [1] 
and Proposition [2] apply here too, so the conclusions are the same. Here too, these conclusions hold 
in the more general setup with measurements satisfying (jlip . 

To appreciate how powerful the ability to take linear measurements in the form of ([T]) with the 
constraint (jlip really is, let us stay with the same task of detecting an interval of length S with a 
positive mean. On the one hand, we have the simple test based on yi studied in Proposition [TJ 
On the other hand, we have the scan statistic 

t+s-i 
max ^ yi, 

i=t 

with observations of the form 

yi = Xi + azi, a := y^n/m. (13) 
While the former requires y^m/n\x.\ — )• oo to be asymptotically powerful, the scan statistic requires 

limT^lxl • (51og+(n/5))-i/2 > 

where log~''(x) := max(log2;, 1). With observations provided in the form of (jl3p . this is asymptoti- 
cally optimal 0. Note that (USD is a special case of (jlip . Hence, the ability of taking measurements 
of the form ([1]) allows to detect structured signals that are potentially much weaker, without a priori 
knowledge of the structure and with much simpler algorithms. Hardware that is able to take linear 
measurements such as ([T]) is currently being developed [§]. 



4.3 A comparison with estimation and support recovery 

The results we obtain for detection are in sharp contrast with the corresponding results in estimation 
and support recovery. Though, by definition, detection is always easier, in most other settings it 
is not that much easier. For example, take the normal mean model described in the Introduction, 
assuming x is sparse with S coefficients equal to > 0. In the regime where5 = pi-^, (3 £ (1/2,1), 
detection is impossible when fi < ^/2rTogn with r < pi(/3), while support recovery is possible when 



H > y/2r log n with r > P2i(3), for a fixed functions pi, p2 : (1/2, 1) (0,oo) [7|,ll5|,ll6|. So the 
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difference is a constant factor in the per-coordinate amplitude. In the setting we consider here, we 
are able to detect at a much smaller signal-to-noise ratio than what is required for estimation or 
support recovery, which nominally require at least m > S measurements regardless of the signal 
amplitude, where S is the number of nonzero entries in x. In fact, [H] shows that reliable support 
recovery is impossible unless f-i is of order at least ^Jnjra. In detection, however, we saw that 
m = 1 measurement may suffice if the signal amplitude is large enough, which can be smaller 
than \Jn/m by a factor of S or ^/S in the nonnegative and general cases respectively. Therefore, 
having the ability to take linear measurements of the form ([T]) in a surveillance setting, it makes 
sense to perform detection as described here before estimation (identification) or support recovery 
(localization) of the signal. 

4.4 Possible improvements 

Though we provided simple algorithms that nearly match information bounds, there might be 
room for improvement. For one thing, it might be possible to reliably detect when, say, y^m/n|x| 
is sufficiently large — for the case where Xj > for all j — without necessarily tending to infinity. 
A good candidate for this might be the Bayesian algorithm proposed in 

More importantly, in the general case of Section [3l we might want to design an algorithm that 
detects any fixed x with high-probability, without averaging over the measurement design. This 
averaging may be interpreted in at least two ways: 

(Al) If we were to repeat the experiment many times, each time choosing new measurement vectors 
and corrupting the measurements with new noise, then for a fixed vector x, in most instances 
the test would be accurate. 

(A2) Given the amplitudes \xj\, j = 1, . . . , n, for most sign configurations the test will be accurate. 

Interpretation |(A1)| is controversial as we do not repeat the experiment, which would amount to 
taking more samples. And interpretation |(A2)| raises the issue of robustness to any sign configu- 
ration. One way — and the only way we know of — to ensure this robustness is to use a CS-like 
sampling scheme, i.e., choosing ai, . . . , in ([T|) such that the matrix with these rows satisfies RIP- 
like properties. This setting is studied in detail in 0], which in a nutshell says the following. Take 
measurement vectors from the Bernoulli ensemble, say, but hold the measurement design fixed. 
This is just a way to build a measurement matrix satisfying the RIP and with low mutual coher- 
ence. In particular, this requires that m is of order at least Slogn, though what follows assumes 
that m ^> S'(logn)^. Based on such measurements, the test based on ^ivf is able to detect when 
(i/m/n) ||x|p —7- oo, which is more stringent than what is required in Proposition [21 while the test 
based on maxj=i^...^„ | ajj t/jj is able to detect when lim inf ^mln maxj | Xj \ (log n) > \/2, 
which, except for the log factor, is what is required for support recovery. And this is essential 
optimal, as shown in [3]. 
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